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Abstract 


We  have  recently  begun  work  in  machine  translation  and  felt  that  it  would  probably  make  sense  to  start  by 
surveying  the  literature  on  evaluation.  As  we  read  more  and  more  on  evaluation,  we  found  that  the  success 
of  an  evaluation  often  depends  very  strongly  on  the  selection  of  an  appropriate  application.  If  the 
applic^on  is  well-chosen,  then  it  often  becomes  fairly  clear  how  the  system  should  be  evaluated. 
Moreover,  the  evaluation  is  likely  to  make  the  system  locdc  good.  Conversely,  if  the  application  is  not 
clearly  identified  (or  worse,  if  the  application  is  poorly-chosen),  then  it  is  often  very  difficult  to  find  a 
satisfying  evaluation  paradigm.  We  begin  our  discussitm  with  a  brief  review  of  some  evaluation  metrics 
that  have  been  tried  in  the  past  and  conclude  that  it  is  difficult  to  identify  a  satisfying  evaluation  paradigm 
that  will  make  sense  over  all  possible  applications.  It  is  probably  wise  to  identify  the  application  first,  and 
then  we  will  be  in  a  much  better  position  to  address  evaluation  questions.  The  discussion  will  then  turn  to 
the  main  point,  a  discussion  of  how  to  pick  a  good  niche  application  for  state-of-the-art  (crummy)  machine 
translation. 


1.  Introductioii 


We  bave  recently  begun  work  in  niachine  translation  and  felt  that  it  would  probably  make  sense  to  stazt  by 
surveying  the  literabtre  cm  evaluaticm.  As  we  read  more  and  more  on  evaluation,  we  found  that  the  success 
of  an  evaluation  often  d^nds  very  strongly  on  the  selection  of  an  appropri^  jq^plication.  If  the 
application  is  well-chosen,  thoi  it  often  becomes  fairly  clear  bow  the  system  should  be  evaluated. 
Moreover,  the  evaluation  is  likely  to  make  the  system  look  good.  Conversely,  if  the  application  is  not 
clearly  identified  (or  worse,  if  the  application  is  poorly-chosen),  then  it  is  often  very  difficult  to  find  a 
satisfying  evaluation  paradigm.  We  begin  our  discussion  with  a  brief  review  of  some  evaluation  metrics 
that  bave  been  tried  in  the  past  and  conclude  that  it  is  difficult  to  identify  a  satisfying  evaluation  paradigm 
that  will  make  sense  over  all  po$.sible  applications.  It  is  probably  wise  to  identify  the  application  first,  and 
then  we  will  be  in  a  much  better  position  to  address  evaluation  questions.  The  discussion  will  then  tiun  to 
the  main  point,  a  discussion  of  how  to  pick  a  good  niche  application  for  state-of-the-art  (crummy)  machine 
translation. 

Why  work  on  machine  translation  now,  arid  what  kind  of  MT  is  most  likely  to  be  commercially  and 
theoretically  profitable?  Though  the  ALP  AC  report  concluded  in  the  sixties  that  there  should  be  more  basic 
research  in  it  stated  clearly  that  this  basic  research  could  not  be  justified  in  terms  of  short-term  renirn 
on  investment.'  In  particular,  when  compared  with  human  capabilities  (still  the  ultimate  test),  MT  systems 
of  the  time  were  not  deemed  a  success,  and  might  never  be. 

This  belief  may  help  explain  the  resistance  of  many  MT  researchers  to  take  evaluation  questions  seriously. 
The  EUROTRA  project,  for  example,  consciously  decided  to  delay  evaluation  discussions  as  long  as 
possible;  “Exact  procedures  for  evaluation  will  be  decided  by  the  programme’s  management  comminee 
toward  the  end  of  each  phase..."  (Johnson  er  al,  1985,  p.  168).  Others  argue  against  any  human-related 
evaluations  as  follows: 


“Performance  of  operational  MT  systems  is  usually  measured  in  terms  of  their  cost  per  1,0(K)  words 
and  their  speed  in  pages  per  post-editor  per  hour  vs.  the  relative  cost  and  speed  of  human  translation.... 
In  my  opinion,  it  is  becoming  increasingly  uninformative  to  compare  ibe  performance  of  MT  systems 
with  that  of  human  translators,  even  though  many  organizations  tend  to  do  that  to  justify  their  MT 
investments.”  (Tucker,  1987,  p.  28) 


We  believe  that  these  attitudes  hurt  the  cause  of  MT  in  the  long  run.  As  is  proved  by  the  increasing 
availability  of  commercial  MT  and  MAT  systems  (such  as  Systran,  Fujitsu’s  Atlas,  Logos,  IBM’s  Shalt, 
and  several  others,  for  less  than  S100,(X)0),  MT  today  is  beginning  to  find  areas  of  real  (commercial) 
applicability.  Thus,  to  the  questions  “Has  anything  changed  since  ALPAC?  How  can  one  build  MT 
systems  that  make  a  difference?’’,  we  answer  that  the  conununity  needs  to  find  evaluation  measures  and 
applications  that  highlight  the  value  of  MT  research  in  those  areas  where  systems  can  be  employed  in  a  teal 
(and  economically  measurable)  way.  Human  and  machine  translation  show  complementary  strengths.  In 
order  to  design  and  build  a  theoretically  and  practically  productive  MAT  system,  one  must  choose  an 
application  that  exploits  the  strengths  of  the  machine  and  does  not  compete  with  the  strengths  of  the  human. 
Ihis  point  is  well  put  in  the  following: 


1.  "The  Committee  recommends  expenditures  in  two  distinct  arees.  The  first  is  computational  linguistics  as  a  part  of  linguistics- 
studies  of  parsing,  sentence  generation,  structure,  semantics,  stalisucs,  and  quantitative  linguistic  matters,  including  experiments  in 
translation,  with  machine  aids  or  without.  Linguistics  should  be  supported  as  sdence,  and  should  not  be  judged  by  any  immediate 
or  foreseeable  contribution  to  practical  translation...  The  second  area  is  improvement  of  thuman]  translation  (with  respect  to 
practical  issues  such  as  speed,  cost,  and  quality)."  (Pierce  el  al.,  t966,  p.  34) 


"Hie  question  now  is  not  whether  MT  (or  Al,  for  th^  matter)  is  feasible,  but  in  what  dkanains  it  is  most 
likely  to  be  effective....  The  c^ea  of  an  evaluation  is,  of  course,  to  determine  whether  a  system  permits 
an  adequate  re^xxise  to  given  needs  and  constraints."  (Lebrberger  and  Bourbeau,  1988,  p.  192) 

What  then  are  appropriate  evaluation  measures?  It  would  be  nice  if  the  evaluations  were  to  identify  those 
(aspects  oO  MT  systems  that  make  them  suitable  for,  and  then  steer  them  towards,  high-payoff  niches  of 
functionality.  But  in  spite  of  all  the  literature  on  MT  evaluation,  the  general  evaluation  measures  that  are 
proposed  often  fail  to  pinpoint  the  strengths  of  systems  and  lead  them  toward  real  utility:  instead,  they  seem 
to  confound  important  and  less  important  aspects.  Tucker’s  review  of  Taum-Meteo  and  Metal,  for 
example,  might  give  one  the  mistaken  impression  that  both  systems  work  about  equally  well  (namely, 
an>rox.  80%):^ 

“Taum-Meteo  has  been  operational  since  1977.  translating  about  five  million  words  annually  at  a  rate 
of  success  of  80%  without  post-editing."  (Tucker,  1987,  p.  31) 

“[T]he  Metal  system  is  reported  to  have  achieved  between  45%  and  85%  ‘correct’  uanslations,  using 
an  experimental  base  of  1,000  pages  of  text  over  the  last  five  years.”  (Tucker,  1987,  p.  32) 

However,  these  numbers  do  rjot  accurately  reflect  the  aucial  difference  between  these  ^vo  systems. 
Taum-Meteo  is  generally  regarded  as  a  fairly  complete  solution  to  the  domain-restricted  task  of  translating 
weather  forecasts  whereas  Metal  is  widely  regarded  as  a  less  complete  solution  to  the  more  ambitious  task 
of  translating  unrestricted  text  The  evaluation  measure  ought  to  be  able  to  highlight  the  strengths  and 
weaknesses  of  a  system.  Apparently,  the  "success  rate"  measure  fails  to  meet  this  requirement, 
presumably  because  it  is  too  vague  to  be  of  much  use.^ 

Unfortunately,  this  failure  seems  to  be  characteristic  of  many  of  the  task-independent  evaluation  metrics 
that  have  been  proposed  thus  far.  Since,  in  our  opinion,  the  blame  is  to  be  laid  on  the  desire  for  generality, 
we  propose  that  MT  evaluation  metrics  should  be  sensitive  to  the  intended  use  of  the  system.  In  this  paper, 
we  begin  by  outlining  metrics  that  have  been  proposed  and  end  by  concluding  that  it  becomes  crucial  to  the 
success  of  an  MT  effort  to  identify  a  high-payoff  niche  af^lication  so  that  the  MT  system  will  stand  up  well 
to  the  evaluation,  even  though  the  system  might  produce  crummy  translations. 


2.  Traditional  Evaluation  Metrics 
2.1  System-based  Metrics 

We  identify  three  major  types  of  evaluation  metrics:  system-based,  text-based  and  cost-based.  System- 
based  metrics  count  intenuti  data  resources  such  as  the  number  of  words  in  the  lexicons,  rules  in  the 
granunars,  semantic,  grammatical,  or  lexical  features,  the  number  of  representation  elements  in  the 
semantic  ontology  or  Interlingua  (if  any),  and  the  number  of  translation  rules  (if  any).  The  literature 
contains  many  examples  of  system-based  metrics,  for  instance: 


2.  According  to  Ijabetle  (pertonal  communicaiioD).  Meteo  cuireoily  achievu  97%  mcces  on  a  volume  of  20  million  words  per  year. 
The  increased  performance  is  largely  due  to  improvements  in  the  communication  system;  communication  noi.se  used  to  be 
responsible  for  a  large  percentage  df  the  failures. 

3.  The  success  rate  of  80^6  reported  in  (Isabelle,  1984,  p.  265)  probably  should  not  be  compared  with  the  numbers  reposted  for  Metal. 
In  addition  to  translating  the  input,  Meteo  also  attempts  to  determine  if  the  translation  should  be  checked  by  a  professional 
translator.  The  80%  figure  repotted  in  (Isabelle,  1984)  refers  to  the  fraction  of  the  input  that  Meteo  handles  by  itself  without 
assistance  from  a  professional  translator.  The  figures  reported  for  Metal  refer  to  an  evaluation  of  the  cosrectness  of  the  output 
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“At  tbe  moment  tbeie  are  about  sixty  subgrammars  for  analysis  and  about  900  rewriting  rules  in  total... 
number  of  rewriting  rules  for  transfer  and  generation  processes  is  around  800,  and  it  will  be  increased  in 
tbe  coming  few  mtmtbs.  Tbe  dictionary  contains  about  16,000  items  at  present,  and  will  be  increased  to 
100,000  items  at  the  end  of  tbe  project.”  (Nagao,  1987,  p.  276) 


An  advantage  of  these  metrics  is  tbat  they  are  easy  to  measure,  wbicb  makes  them  popular.  But  since  these 
metrics  are  tied  to  a  particular  system,  they  cannot  be  used  very  effectively  for  comparing  two  systems. 
They  are  much  more  effective  fw  calibrating  system  growth  over  time.  Tbe  major  disadvantage  of  these 
metrics  is  that  they  are  not  necessarily  related  to  utility. 


2,2  Text-based  Metrics 

2.2.1  Sentence-Based  Metrics  These  metrics,  the  most  common  class,  are  applied  to  individual  sentences 
of  target  texts  by  counting,  for  example,  tbe  number  of  sentences  semantically  and  stybstically  correct,  the 
number  of  sentences  semantically  correct,  but  with  odd  style,  the  number  of  sentences  partially 
semantically  correct,  tbe  number  of  sentences  semantically  and  syntactically  incorrect,  and  the  number  of 
sentences  missed  altogether.  A  good  example  appears  in  (Nagao  et  al,  1986),  in  which  sentences  are 
classed  into  one  of  five  categories  of  decreasing  intelligibility  and  into  one  of  six  categories  of  decreasing 
accuracy.  Another  example  is  the  evaluations  developed  to  measure  the  results  of  Eurotra  systems  (see 
Johnson  et  al.,  1985). 

Given  the  subjective  nature  of  semantic,  syntactic,  and  (especially)  stylistic  “correcmess”,  these  metrics 
are  impossible  to  make  precise  in  practice.  In  addition,  their  limitation  to  single  sentences  makes  them  too 
simplistic  (for  example,  it  is  not  dear  bow  to  scale  the  metric  when  several  source  sentences  are  combined 
in  the  target  text,  or  when  parts  of  them  are  grouped  into  sentences  differently). 


2.2.2  Comprehensibility  Metrics  These  metrics  seek  to  measure  translation  quality  by  testing  the  user’s 
comprehension  of  the  target  text  as  a  whole.  They  include  counting  the  number  of  texts  translated  well 
enough  for  full  comprehension,  tbe  number  of  texts  in  which  enough  could  be  gleaned  to  get  a  reasonably 
good  understanding  of  the  content,  though  details  may  be  missing,  the  number  of  texts  in  which  some 
ccHitent  could  be  gathered,  enough  to  tell  whether  the  text  is  of  interest  to  the  user  or  not,  tbe  number  of 
texts  with  fatal  inconsistencies  or  omissions,  and  the  number  of  texts  missed  altogether. 

These  evaluation  metrics  enjoy  some  significant  advantages.  First,  they  can  be  performed  by  the  intended 
user  of  the  translation,  requiring  little  or  no  source  language  expertise.  Second,  they  take  in  suide  the  mis- 
or  even  non-translation  of  text  due  to  certain  relatively  isolated  phenomena  which  have  proven  very  hard  to 
handle  in  ccanputational  systems  in  a  general  way  (but  which  people  can  figure  out  themselves  fairly 
easily).  A  major  disadvantage  of  these  metrics  is  tbe  difficulty  of  quantifying  them.  One  approach  to 
overcome  this  difficulty  is  to  create  comprehension  questionnaires  that  measure  (in  SAT-test-like  manner) 
how  understand^le  translations  are  to  their  intended  users  with  respect  to  their  intended  uses.  An  example, 
using  a  test  suite  of  texts,  is  proposed  in  (King  and  Falkedal,  19%).  A  second  approach  is  to  determine 
bow  willing  users  would  be  to  pay  for  professional  translation  of  tbe  text,  given  tbe  translated  version. 
Since  professional  translation  is  expensive,  the  users  will  be  motivated  to  identify  tbe  more  useful  systems. 


2.2.3  Amount  of  Post-Editing  Metrics  in  this  subclass  are  based  on  the  amount  of  work  required  to  turn  tbe 
translated  text  into  a  form  indistinguishable  from  a  human  translator's  effort  Ways  of  quantizing  this 
include  counting  the  number  of  editing  keystrokes  required  per  page,  timing  tbe  revision  process  per  page, 
and  counting  the  percentage  of  machine-translated  words  in  final  text.  An  example  is  tbe  keystroke  count 
reported  as  follows: 
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“As  an  alternate  measure  of  tbe  system’s  performance,  one  of  us  corrected  each  of  tbe  sentences  in  the 
last  three  categories  (different,  wrong,  and  ungrammatical)  to  either  the  exact  or  the  alternate  category. 
Counting  one  stroke  for  each  letter  that  must  be  deleted  and  one  stroke  fm*  each  letter  that  must  be 
inserted,  776  strokes  were  needed  to  repair  all  of  tbe  decoded  sentences.  This  compares  with  the  1,916 
strokes  required  to  generate  all  of  the  Hansard  translations  from  scratch.”  (Brown  et  al.,  1990,  p.  84) 


Some  researchers  object  to  keystroke  counting  because  they  don’t  believe  that  tbe  counts  are  correlated 
with  utility. 


23  Cost-based  Measures 

The  third  major  type  of  metric  concentrates  on  tbe  system’s  efficiency  in  producing  a  translation,  as  in: 

1 .  cost  per  page  of  acceptable  translation  (machine,  human,  or  mixed), 

2.  time  per  page  of  acceptable  translation  (machine,  human,  or  mixed). 

One  such  evaluation  was  done  on  Taum-Aviation  (Isabelle  and  Bourbeau,  1985) 


Task 

Machine 

Human 

Preparation  /  input 

$0,014 

$0,000 

Translation 

$0,079 

$0,100 

Human  revision 

$0,068 

$0,030 

Transcription  /  proofreading 

$0,022 

$0,015 

Total 

$0,183 

$0,145 

The  problem  with  cost-based  metrics  is  that  they  often  don’t  make  the  systems  look  very  good.  As  can  be 
noted  from  the  table  above,  tbe  evaluation  shows  that  Taum-Aviation  is  actually  more  expensive  than 
human  translatitm  (HT).  If  one  wants  the  system  to  look  good,  it  is  important  to  pick  a  good  niche 
application. 

Some  might  accuse  us  of  “lying  with  statistics”.  There  is  a  fine  line  between  realism  and  fraud.  We 
would  say  that  it  is  realistic  to  pick  an  “easy”  niche  application  if  the  application  is  likely  to  have  real 
value  (e.g.,  Meteo  (Isabelle,  1984)).  On  the  other  hand,  the  strategy  does  run  the  risk  of  raising 
expectations  unrealistically  if  the  application  only  appears  to  have  real  value  (e.g.,  the  original  GU 
experiment  (see  section  7.1)).  Of  course,  we  would  want  to  concentrate  our  efforts  on  good  niche 
applications  that  really  do  have  value  and  avoid  the  bad  ones  that  look  like  they  ought  to  scale  up  to 
something  useful  but  actually  don't. 


3.  Characteristics  of  a  Good  Niche 

We  believe  a  good  niche  application  should  meet  as  many  of  tbe  following  desiderata  as  possible; 

a.  it  should  set  reasonable  expectations, 

b.  it  should  make  sense  economically, 

c.  it  should  be  attractive  to  the  intended  users, 

d.  it  should  exploit  the  strengths  of  the  machine  and  not  compete  with  the  strengths  of  the  human, 

e.  it  should  be  clear  to  the  users  what  the  system  can  and  cannot  do.  and 
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f. 


it  should  encourage  the  field  to  move  forward  toward  a  sensible  long-term  goal. 


4.  Extensive  Post-Editing  (EPE):  An  Ineppropristc  Niche 

It  is  not  easy  to  identify  a  good  niche  application.  One  cannot  simply  take  a  state-of-the-art  MT  {HOgram 
and  give  it  to  a  bunch  of  salesmen  and  expect  a  miracle.  One  has  to  find  an  application  that  makes  sense. 

The  extensive  post-editing  (Ef^)  application  would  appear  to  be  a  natural  way  to  get  value  out  of  a  state- 
of-the-art  MT  system.  But  unfortunately,  the  application  fails  to  meet  most  of  the  desiderata  prt^xised 
above. 

4.1  (a)  Realistic  Expectations 

One  can  find  claims  that  EPE  either  increases  or  decreases  productivity  by  anywhere  from  a  factor  of  1  to  2 
or  2  to  1.  No  matter  what  the  truth  is,  the  application  would  |»obably  be  more  successful  in  the 
marketplace  if  expectations  were  more  realistic.  One  rarely  finds  disclaimers  of  the  form  “your  mileage 
may  vary”  after  some  of  the  claims  that  have  been  made  on  behalf  of  the  EPE  application; 


“Although  you  can  expect  to  at  least  double  your  translator's  output,  the  real  cost-saving  in  MT  lies  in 
complete  electronic  transfer  of  information  and  the  integration  into  a  fully  electronic  publishing 
system.”  (Magnusson-Munay,  1985,  p,  180) 

“Substantial  rises  in  translations  output,  by  as  much  as  75  per  cent  in  one  case,  are  being  reported  by 
users  of  the  Logos  machine  translation  (MT)  system  after  only  a  few  months.”  (Lawson,  1984,  p.  6) 


“For  one  type  of  text  (data  description  manuals),  we  observed  an  increase  in  throughput  of  30  per 
cent.”  (Tschira,  1985) 


Statements  such  as  these  run  the  risk  of  setting  unrealistic  expectations,  and  consequently,  in  the  long  run,  it  # 

is  possible  that  they  could  actually  do  more  harm  than  good.  (We  discuss  the  dangers  of  unrealistic 

expectations  in  section  7.)  If  users  could  really  expect  even  modest  gains  in  productivity,  then  one  would 

have  expected  the  EPE  products  that  have  been  offered  by  ALPS,  Logos.  Systran,  Weidner  and  others 

would  have  been  more  successful  in  the  marketplace  than  they  have  been. 


4  J  (b)  Cost  ErtectiveoeM 

In  fact,  we  were  rmher  surprised  to  discover  that  there  have  been  a  number  of  trials  indicating  that  EPE 
might  actually  be  more  expensive  than  human  translation  (HT).  For  instance.  Van  Slype  (1979)  estimated 
that  EPE  costs  475  BFrs.  per  100  words,  almost  twice  as  much  as  HT  (150-250  BFrs.  per  1(X)  words). 
Similarly,  the  Elanadian  government  found  more  or  less  the  same  result  in  their  trial  of  the  Weidner 
product; 


4. 


Many  long-l*nn  goals  have  been  proposed  over  the  years;  FAHQT  (fully-auiomatic  high-quality  translatioa)  (Bar-Hillel.  I960,  p. 
94)  is  puhaps  ooe  of  the  more  welt-kmwn  proposals. 
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“mhe  HT  production  ch«n  was  significantly  faster  than  the  MT  production  chain.  How  much  faster 
depends  on  which  phased  of  the  MT  chain  are  counted.  If  we  count  all  the  steps  on  the  log  form,  human 
translation  was  rcaii^  twice  as  fast  as  machine  translation.  If  we  discount  the  time  that  the  machine 
actually  takes  to  uanslate  (on  the  assumption  that  the  participants  could  use  this  time  to  do  other  useful 
tasks),  as  w'ell  as  the  time  for  the  second  dictionary  update  (on  the  grounds  that  these  new  or  modified 
oitrie^  are  not  intended  for  the  current  text),  MT  remains  27%  slower  than  HT.  If,  in  addition,  we 
discount  the  time  for  text  entry,  assuming  that  source  texts  arrive  in  machine  readable  form  that 
Weidner  could  import,  MT  still  remains  5%  slower  than  HT  for  all  the  texts  translated  during  the 
operational  phase  of  the  trial.*’  (Macklovitch.  1991.  p.  3) 

In  fact,  there  have  been  questions  about  the  cost  effectiveness  of  the  EPE  application  dating  back  to  the 

ALPAC  report,  well  before  many  of  these  products  were  introduced  into  the  markeqjlace:^ 


“The  postedited  translation  took  slightly  longer  to  do  and  was  more  expensive  than  conventional 
human  translation...  Dr.  J.  C.  R.  Licklider  of  IBM  and  Dr.  Paul  Garvin  of  Bunker-Ramo  said  they 
would  not  advise  their  companies  to  establish  such  a  service.”  (Pierce  et  at.,  1966,  p,  19) 


It  is  difficult  to  know  how  to  balance  the  results  of  these  govenunent  trials  against  some  of  testimonials 
cited  above.  It  is  probably  the  case  that  EPE  saves  time  and  money  in  some  applications,  and  buns  in 
odiers.  No  matter  what  the  fact  are,  though,  it  is  almost  certainly  the  case  that  the  field  would  be  in  better 
shape  if  expectations  were  better  bandied.  It  would  be  most  unfortunate  if  a  potential  user  were  to  boy  into 
EPE,  expecting  to  save  a  bundle,  only  to  discover  the  hard  way  that  it  may  not  be  cost  effective  in  his  or  her 
particular  application.  We  were  rather  surprised  to  discover  that  EPE  could  actually  be  slower  than  HT,  but 
after  thinking  about  it  for  a  little  while,  it  should  have  been  obvious  that  it  can  take  longer  to  fix  a  badly 
written  piece  of  prose  than  it  would  take  to  start  from  scratch. 


4  J  (c)  Attractiveness  to  Intended  Users 

EPE  has  failed  to  gain  much  acceptance  among  the  intended  target  audience  of  professional  translators, 
because  post-editing  turns  out  to  be  an  extremely  boring,  tedious  and  unrewarding  chore.^ 


“Most  of  the  translators  found  postediting  tedious  and  even  frustrating.  In  particular,  they  complained 
of  the  contorted  syntax  produced  by  the  machine.  Other  complaints  concerned  the  excessive  number  of 
lexical  alternatives  provided  and  the  amount  of  time  required  to  make  purely  mechanical  revisions" 
(Pierce  et  al,  1966,  p.  96) 

“Many,  but  not  al).  translators  decided,  after  the  first  phase  of  the  MT  experiment,  that  Systran  was  not 
a  translation  aid.  because  they  found  that  it  took  too  long,  and  was  too  tedious,  to  convert  raw  MT  into  a 
translation  ‘to  which  they  would  be  prepared  to  put  their  name.'”  (Wagner,  1985,  p.  203) 

“When  asked  by  the  consultant  if  they  would  like  to  continue  working  with  Weidner  on  the  same  texts 
after  the  end  of  the  trial,  not  a  single  participant  accepted.”  (Macklovitch,  1991,  p.  4) 

After  reading  Macklovitch’s  description  of  some  of  the  errors  in  (Macklovitch,  1986),  one  can  easily 
appreciate  why  some  of  the  translators  would  be  frustrated  with  the  post-editing  ta.sk.  Macklovitch 
observed  that  approximately  half  of  the  errors  in  one  .sample  involved  the  ovenj.se  of  French  articles.  In 


5.  The  cost  effectiveness  of  the  EPE  ipplicatioo  u  discussed  in  more  deUil  in  Apfiendis  14  of  the  ALPAC  report.  The  ippendis 
observed  lhal  postediting  tends  to  "impede  the  rapid  trxnslnlors  nod  assist  the  slow  translators"  (Pierce  et  al.  1966.  p.  94).  This 
would  suggest  that  EPE  products  might  be  more  appropnale  for  ca.sual  use  by  an  amateur  rather  than  daily  u-se  by  a  professional 

6.  Perhaps  the  task  would  be  less  tedious  if  the  user  interface  were  made  more  flexiUe  and  more  uscr-fnendly. 
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translatlng  an  English  noun  phrase  into  French,  it  is  a  pretty  good  bet  that  the  French  noun  phrase  should 
begin  with  an  article  even  if  diere  isn’t  one  in  English.  However,  this  rule  does  not  bold  in  t^les,  where 
the  French  use  of  articles  is  apparently  somewhat  more  like  English.  As  it  bapf)cned.  one  of  the  texts  used 
in  the  trial  contained  a  very  long  list  of  crop  varieties  published  by  Agriculture  Canada,  most  of  which 
should  not  have  been  translated  with  an  article.  Unfortunately,  the  Weidner  system  did  not  know  that  noun 
phrases  woik  differently  in  tables,  and  consequently,  the  post-editor  was  faced  with  the  rather  tedious  task 
of  deleting  the  article  and  adjusting  the  capitalization  for  each  of  the  crop  varieties  in  this  very  long  list. 
The  professional  translator  probably  would  have  found  it  quicker  and  more  rewarding  to  translate  the  list 
from  scratch. 

4.4  Kay’s  Characterization  of  EPE 

One  can  continue  to  go  through  the  list  of  desiderata  proposed  above  and  find  even  more  reasons  why  EPE 
is  an  inappropriate  niche.  Rather  than  beat  a  dead  horse  ourselves,  we  thought  we  would  let  Martin  Kay  do 
it  for  us,  as  only  he  can: 


“There  was  a  long  period  --  for  all  1  know,  it  is  not  yet  over  -  in  which  the  following  comedy  was 
acted  out  nightly  in  the  bowels  of  an  American  government  office  with  the  aim  of  rendering  foreign 
texts  into  English.  Passages  of  innocent  prose  on  which  it  was  desired  to  effect  this  delicate  and 
complex  operation  were  subjected  to  a  process  of  vivisection  at  the  hands  of  an  uncomprehending 
electronic  monster  that  transformed  them  into  stammering  streams  of  verbal  wreckage.  These  were 
then  placed  into  only  slightly  more  gentle  hands  for  repair.  But  the  damage  had  been  done.  Simple 
tools  that  would  have  done  so  much  to  make  the  repair  work  easier  and  more  effective  were  not  to  he 
had  presumably  because  of  the  voracious  appetite  of  the  monster,  which  left  no  resources  for  anything 
else.  In  fact,  such  remedies  as  could  be  brought  to  the  tortured  remains  of  these  texts  were  administered 
with  colored  pencils  on  paper  and  the  final  copy  was  produced  by  the  action  of  human  fingers  on  the 
keys  of  a  typewriter.  In  short  one  step  was  singled  out  of  a  faily  long  and  complex  process  at  which  to 
perpetrate  automation.  The  step  chosen  was  by  far  the  least  well  understood  and  quiu:  obviously  the 
least  apt  for  this  kind  of  treatment"  (Kay,  1980,  "The  Proper  Place  of  Men  and  Machines  in  Language 
Translation,”  p.  2) 


5.  A  Constructive  Suggestion;  The  Workstation  Approach 

Having  established  that  EPE  is  inappropriate.  Kay  then  suggested  a  workstation  approach.  At  first  the 
workstation  might  do  little  more  than  provide  word-processing  functionality,  dictionary  access  and  so  on, 
but  as  lime  goes  on,  one  might  imagine  functionality  that  begins  to  look  more  and  more  like  machine 
translation. 


“I  come  now  to  my  proposal.  I  want  to  advocate  an  incremental  approach  to  the  problem  of  how 
machines  should  be  used  in  language  translation.  The  word  approach  can  be  taken  in  its  original 
meaning  as  well  as  the  one  that  has  become  so  popular  in  modem  technical  jargon.  I  want  to  advocate  a 
view  of  the  problem  in  which  machines  are  gradually,  almost  imperceptibly,  allowed  to  take  over 
certain  functions  in  the  overall  translation  process.  First  they  will  lake  over  functions  not  essentially 
related  to  translation.  Then,  little  by  little,  they  will  approach  translation  itself.  The  keynote  will  be 
modesty.  At  each  stage,  we  will  do  only  what  we  know  we  can  do  reliably.  Little  steps  for  little  feet!” 
(Kay.  1980.  p.  11) 


In  his  concluding  remarks,  Kay  expressed  the  hope  that  bis  approach  be  implemented  by  someone  with 
enough  “taste”  to  be  realistic  and  pragmatic. 


"The  (ranslator’s  amanuensis  [woricstation]  will  noc  run  before  it  can  walk.  It  will  be  called  rni  cmly  fcM- 
that  for  which  its  masters  have  learned  to  trust  it.  It  will  not  require  constant  infusions  of  new  ad  hoc 
devices  that  only  expensive  vendors  can  supply.  It  is  a  framework  that  will  gracefully  accotninodaie  the 
future  contributions  that  linguistics  and  computer  science  are  able  to  make.  One  day  it  will  be  built 
because  its  very  modesty  assures  its  success.  It  is  to  be  hoped  that  it  will  be  built  with  taste  by  people 
who  understand  languages  and  computers  well  enough  to  know  how  little  i*  is  that  they  know."  (Kay, 
1980,  p.  20) 


In  fact,  Kay’s  approach  has  recently  been  implemented  by  people  who  understand  the  practical  realities 
well  enough  to  take  an  even  more  modest  approach  than  Kay  himself  probably  would  have  taken.  CWARC 
(Canadian  Workplace  Automation  Research  Center)  has  undertaken  to  provide  the  Canadian  government’s 
Translation  Bureau  with  a  translator’s  workstation  that  could  be  deployed  in  the  near-term  to  the  bureau's 
900  full-time  translators  (Mackloviich,  1989).  For  obvious  pragmatic  considerations,  they  have  decided  to 
use  the  following  off-the-shelf  components; 


a.  a  PC/ AT, 

b.  network  access  to  the  Tennium  terminology  database  on  CD-ROM. 

c.  WordPerfect,  a  text  editor, 

d.  CompareRite,  a  program  for  comparing  two  versions  of  a  text  file, 

e.  TextSearcb,  a  program  for  making  concordances  and  counting  word  frequencies. 

f.  Mercury/  Termex,  a  program  for  maintaining  a  private  terminology  database, 

g.  Procomm,  a  program  providing  remote  access  to  data  banks  via  a  telephone  modem, 

h.  Seconde  Memoire.  a  program  that  deals  with  French  verb  conjugations,  and 

i.  Software  Bridge,  a  program  for  converting  word  processing  files  from  one  commercial  format  into 
another. 


This  is  clearly  a  sensible  starting  point  for  introducing  technology  ;nto  the  translator’s  workplace.  They 
will  hopefully  be  able  to  demonstrate  that  the  PC-based  workstation  is  clearly  superior  to  dictation 
machines.  After  they  have  achieved  a  trackrecord  of  success  and  the  new  technology  has  been  in  place  for 
a  while,  they  will  be  in  a  much  better  position  to  inuoduce  additional  tools,  which  might  be  more  exciting 
to  us,  but  also  more  risky  for  the  managers  at  the  translation  bureau. 

One  might  imagine  all  kinds  of  exciting  tools.  For  example,  the  workstation  could  have  a  "complete"  key, 
like  control-space  in  Emacs,  which  would  fill  in  the  rest  of  a  partially  typed  word/  phrase  from  context. 
One  might  take  this  idea  a  step  further  and  imagine  that  it  ought  to  be  able  to  build  a  super-fast  typewriter 
that  would  be  able  to  correct  typos  and  fill  in  context  given  relatively  few  keystrokes.  Peter  Brown 
(personal  communication)  once  remarked  that  such  a  super-fast  typewriter  ought  to  be  possible  in  the 
monolingual  case,  observing  that  there  is  so  much  redundancy  in  language  that  the  user  should  only  have  to 
type  a  few  characters  per  word,  or  about  the  equivalent  of  1.25  bits  per  character  (Shannon,  1951),’  which 
is  only  slightly  more  than  a  byte  (ascii  character)  per  English  word  on  average.  The  user  should  have  to 
type  even  less  in  the  bilingual  case  because  the  source  language  should  provide  quite  a  number  of 
additional  bits  of  information. 


7.  Shannon'.'!  (ulimale  that  English  has  an  entropy  ot  1.25  bits  per  characier  is  probably  too  optimistic.  In  practice,  one  would 
probably  expect  a  practical  system  to  have  an  entropy  somewhat  closer  to  1.76  bits  per  character  (Brown  ef  a/..  1991). 
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The  super-fast  typewriter  may  still  be  a  ways  off,  but  we  are  almost  already  in  a  position  to  provide  some 
very  useful  but  less  ambitious  facilities.  In  particular,  tbe  Translation  Bureau  currently  spends  a  lot  of 
resources  retranslating  minor  revisions  of  previously  translated  materials  (e.g.,  annual  r^KHts  that  generally 
don’t  change  much  year  after  year).  It  would  be  very  useful  if  there  were  some  standard  tools  for  archiving 
and  retrieving  previously  translated  texts  so  that  the  translators  would  have  accm  to  tbe  previous 
translations,  when  appropriate.  It  is  also  becoming  possible  to  use  bilingual  concordances  to  help  with 
terminological  issues. 

Tbe  workstation  application  stands  up  to  tbe  six  desiderata  proposed  in  section  3  much  better  than  tbe  EPE 
application.  It  is  (a)  much  mote  realistic,  so  it  should  have  a  better  chance  of  (b)  economic  success  After 
all,  it  ought  to  be  able  to  beat  dictation  machines,  at  least  in  many  cases.  In  addition,  it  has  a  better  chance 
of  (c)  being  attractive  to  the  intended  users  and  (d)  exploiting  tbe  strengths  of  the  machine  as  well  as  those 
of  tbe  human  since  it  is  being  developed  and  tested  by  professional  translators  at  the  request  of  a  translation 
organization.  Since  it  is  so  modest  it  should  be  (e)  fairly  clear  what  it  can  and  cannot  do.  Finally,  there  is  a 
(0  clear  path  plan  toward  a  desirable  long-term  goal,  since  the  strategy  explicitly  calls  for  more  and  more 
ambitious  tools  as  lime  goes  on. 


6.  Another  Constructive  Suggestion:  Appeal  to  the  End-User 

The  workstation  approach  is  an  attempt  to  appeal  to  the  professional  traitslators;  it  uses  the  benefits  of 
office-automation  as  a  way  to  sneak  technology  into  tbe  translator’s  workplace.  An  alternative  approach, 
which  also  seems  promising  to  us,  is  to  use  the  speed  advantages  of  raw  (or  almost  raw)  MT  to  appeal  to 
the  end-user  who  many  not  require  high-quality. 

6.1  Rapid  Post-Editing 

After  noting  the  translators  were  unlikely  to  support  tbe  EPE  applicabon  because  they  are  unlikely  to 
choose  MT  over  HT,  Wagner  found  that  end-users  would  often  opt  for  crummy  quick-and-diny  translation, 
if  they  were  given  a  choice. 

"We  therefore  decided  to  use  Systran  in  a  different  way  -  to  provide  a  faster  translation  service  for 
those  translation  users  who  wanted  it,  and  were  willing  to  accept  lower-quality  translation.’’  (Wagner, 
1985,  p.  203) 

The  output  from  Systran  was  passed  through  a  ‘rapid  post-editing’  service  that  emphasized  speed  (4-5 
pages  per  hour)  over  quality.  When  the  project  was  first  presented  to  the  translation  staff,  it  was  well- 
received  and  13  out  of  35  volunteered  to  offer  the  rapid  post-editing  service  on  the  understanding  that  they 
could  opt  out  if  they  did  not  enjov  it,  Wagner  found  that  ‘‘the  option  is  popular  with  a  number  of  users  and 
perhaps  surprisingly,  welcomed  with  some  enthusiasm  by  CEC  (Commission  of  the  European 
Communities]  translators  who  find  rapid  post-editing  an  interesting  challenge’’  (Hutchins,  1986,  p.  261). 

Wagner’s  rapid  post-editing  service  is  a  much  better  application  of  crummy  MT  than  EPE  because  it  gives 
all  parties  a  choice.  Both  the  users  and  the  translators  are  more  likely  to  accept  the  new  technology,  warts 
and  all.  if  they  are  given  tbe  choice  to  go  back  and  do  things  the  old-fashioned  way.  The  trick  to  being  able 
to  capitalize  on  the  speed  of  raw  MT  is  to  persuade  both  the  translators  and  the  end-users  to  accept  lower 
quality.  Apparently  the  end-users  are  more  easily  convinced  than  the  translators,  and  therefore,  for  this 
approach  to  fly,  it  is  important  that  the  end-users  be  in  the  position  to  choose  between  speed  and  quality. 

62  No  Post-Editing 

The  Cieorgetown  system  was  used  extensively  at  the  EIJRATOM  Re.search  Center  in  Ispra,  Italy,  and  the 
Atomic  Energy  Commission’s  Oak  Ridge  Natirmal  Laboratory  from  1963  until  1973.  Translations  were 
delivered  without  pre-editing  or  post-editing.  In  1972-1973,  Bozena  Heni.sz-Dostert  (now  Bozena 
Thompson)  conducted  an  evaluation  and  concluded  that  u.sers  were  quite  happy  with  raw  MT; 
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"The  users  presented  a  rather  satisfied  group  of  custraners,  since  %  percent  of  them  had  or  would 
recommend  machine-translafion  services  to  their  colleagues,  even  though  the  texts  were  said  to  require 
almost  twice  as  much  time  to  read  as  original  English  texts  (humanly-translated  texts  also  were  judged 
to  take  longer  to  read,  but  only  about  a  third  longer),  and  that  machine-translated  texts  were  said  to  be 
21  percent  uitintelligible.  In  spite  of  slower  service  than  desired  and  a  high  demand  on  reading  time, 
machine  translation  was  preferred  to  human  translation  by  87  percent  of  the  respondents  if  the  latter 
took  three  times  as  long  as  the  former.  The  reasons  for  the  preference  were  not  only  earlier  access,  but 
also  the  feelings  that  the  ‘machine  is  more  honest’,  and  that  since  human  labor  is  not  invested  it  is  easy 
to  discard  a  text  which  proves  of  marginal  interest,  (jetting  used  to  reading  machine-translation  style 
did  not  present  a  problem  as  evidenced  by  the  answers  of  over  95  percent  of  the  respondents.’’ 
(Henisz-Dostert,  1979,  p.  206) 


It  is  also  interesting  to  compare  the  attitudes  of  the  users  of  this  service  the  with  attitudes  of  the  translators 
mentioned  above.  Henisz-Dostert  found  that  end-users  were  generally  quite  supportive,  and  would 
recommend  the  service  to  a  friend,  whereas  Macklovitch  found  that  professional  translators  were  generally 
unwilling  to  continue  using  the  service  themselves,  let  alone  recommend  the  service  to  a  friend. 


“A  grateful  word  is  in  order  on  the  users’  attinides,  who  were  most  cooperative  and  friendly,  and 
interested  in  what  was  involved  in  machine  translation.  They  showed  their  familiarity  with  the 
aberrations  of  the  texts,  some  of  which  were  considered  quite  amusing  ‘classics',  e.g.,  ‘waterfalls' 
instead  of  ‘cascades’  (the  users  asked  that  this  not  be  changed!).  Very  commonly,  and  understandably, 
they  were  interested  in  improvements  and  offered  many  suggestions.  An  example  of  an  extreme 
attitude  on  the  part  of  one  user  in  this  respect  was  that  of  ‘cheating’  on  the  questionnaire  by  giving  less 
positive  answers  than  in  oral  discussions.  When  subsequently  asked  about  this,  be  reacted  with 
something  like;  ‘I  use  it  so  much,  I  want  you  to  improve  it,  and  if  1  show  that  I  am  satisfied,  you  will 
not  work  on  it  any  more.’’  (Henisz-Dostert,  1979,  p.  151) 


Why  are  these  users  so  much  more  satisfied  with  MT  than  the  translators  involved  in  the  Canadian 
government’s  trial  of  Weidner?  We  believe  the  difference  is  the  application.  It  makes  sense  to  offer  end- 
users  the  option  to  trade  off  speed  for  quality,  whereas  it  does  not  make  sense  to  try  to  force  translators  to 
become  post-editors.  Consider  the  example  of  the  erty  vmieties  mentioned  above.  Many  end-users  might 
not  be  bothered  too  much  by  the  extra  articles  because  they  can  quickly  skim  past  the  mistakes,  but  the 
professional  translator  might  feel  quite  differently  about  the  extra  articles  because  be  or  she  will  have  to  fix 
them. 


6J  Even  More  Modest  Attempts  to  Appeal  to  the  End-User 

Consider,  for  example,  the  problem  of  reading  email  from  other  countries.  The  first  author  currently 
receives  several  messages  a  day  in  French  such  as  the  following:* 


Pour  repondre  aux  questions  de  Maurizio  LANA,  j’ai  entendu  dire  de  bonnes  cboses  concemant  le 
programme  ALPS  de  Alan  MELBY.  C’est  au  moins  le  nom  de  sa  societe  (ALPS)  qui  se  trouve  a  Provo 
ou  a  Orem  (Utah,  USA).  II  est  egalement  professeur  de  iinguistique  a  la  Brigham  Young  University 
(Provo,  Utah). 


It  might  be  possible  to  provide  a  tool  to  help  recipients  whose  French  is  not  very  good.  Imagine  that  the 


S.  These  nie.ssages  usually  arrive  wiihoul  accents. 


email  reader  bad  a  “Cliff-note”  mode  that  would  gloss  many  of  the  content  words  with  an  English 
equivalent: 

Pour  respondre  aux  questions  de  Maurizio  LANA. 

answer  queatiotu 

j'ai  entendu  dire  de  bonnes  choses  ooncemant 

heard  eay  good  thiagt  concerning 

Cliff-note  mode  could  be  used  as  a  way  to  sneak  technology  into  the  email  reader,  just  as  Kay’s  workstation 
aj^roacb  is  a  way  of  sneaking  technology  into  the  trsmslator’s  workplace.  At  first.  Cliff-note  mode  would 
do  little  more  than  table  lookup,  but  as  time  goes  on,  it  might  begin  to  look  more  and  more  like  machine 
translation.  In  the  future,  for  example,  the  system  might  be  able  to  gloss  the  phrase  le  nom  de  sa  societe  as 
the  name  ofkis  company,  but  currently  the  system  would  gloss  nom  as  behalf  {oi  in  au  nom  de),  and  societe 
as  society,  because  diese  senses  happen  to  be  more  common  in  the  Canadian  Hansards  (parliamentary 
debates),  which  were  used  to  train  the  system.  Obviously,  the  results  would  be  much  improved  if  we 
started  with  a  more  representative  sample  of  general  language,  but  nevertheless,  even  these  results  may  be 
useful,  at  least  for  users  whose  French  is  sufficiently  weaic 

Cliff-iK>te  mode  stands  up  fairly  well  to  the  six  desiderma.  (a)  It  sets  reasonable  expectations,  (b)  It  doesn't 
cost  much  to  run.  (c)  It  ought  to  be  attractive  to  users.  After  all,  those  who  don't  like  it,  don’t  have  to  use 
it  (d)  It  is  well-positioned  to  integrate  the  strengths  of  the  machine  (vocabulary)  without  ctunpeting  with 
the  strengths  of  the  user  (knowledge  of  function  words,  syntax  and  domain  constraints),  (e)  It  is  so  simple 
that  users  shouldn’t  have  any  trouble  appreciating  both  the  strengths  as  well  as  the  weaknesses  of  the 
word-for-word  approach.  F'mally,  (0  the  strategy  of  gradually  introducing  more  and  more  technology  is 
ideally  suited  for  advancing  the  field  toward  desirable  long-term  goals. 

Perhaps,  it  may  already  be  the  case  that  the  field  can  deliver  much  more  than  cliff-note  mode.  If  so,  after 
have  bought  into  cliff-note  mode,  the  marketplace  would  be  well -positioned  to  appreciate  these  these 
improvements. 

It  may  seem  perverse  to  suggest  that  we  should  try  to  deliver  much  less  than  the  state-of-the-art.  However, 
in  the  near  term,  one  probably  cannot  deliver  a  small,  reliable,  easy-to-use,  inexpensive  MT  system  with 
broad  coverage  that  would  be  able  to  do  much  better  than  cliff-note  mode.  It  is  probably  better  to  do 
something  modest,  than  try  to  do  too  much  and  end  up  accomplishing  too  little. 


7.  Concluskm 

We  have  identified  six  desiderata  for  a  good  niche  application.  Two  marketing  strategies  appear  to  meet 
these  six  desiderata  fairly  well: 

1.  .  use  the  benefits  of  office-automation  to  sell  to  the  professional  uanslator,  or 

2.  use  the  speed  advantages  of  raw  (or  almost  raw)  MT  to  sell  to  the  end-user  who  many  not  require 
high-quality.® 

9.  Other  possibililier  have  alao  faeeo  lucoentful  in  the  paat.  Xerox  for  example,  bar  obtained  in^reaaive  reaulta  by  iotroducing  a 
restricted  language  into  the  document  preparation  organization  (Hutchins.  1986,  p.  294).  Smart  Systems  has  also  exploited  the  use 
of  a  restricted  language  in  organizatiotis  that  generate  text  Limiting  the  domain  is  another  formula  for  success.  The  classic 
example  is  Meteo  (labile,  1984).  Unfortunately,  however,  it  is  very  hard  to  hod  very  many  other  naturally-occurring  limited 
domains  that  people  care  about,  and  consequently,  this  strategy  is  unlikely  to  be  repeated  very  many  times  in  the  future. 


Tbe  discussion  has  stressed  pragmatism  throughouL  The  speech  processing  community,  for  example, 
has  been  somewhat  more  successful  recently  in  making  it  possible  to  report  crummy  results.  It  is 
now  quite  acceptable  in  tbe  ^leech  community  to  work  on  very  restricted  domains  (e.g.,  i^xiken 
digits,  resource  management  (RM),  airline  traffic  information  systems  (ATIS))  and  to  report 
performance  that  doesn’t  compare  with  what  people  can  do.  No  one  would  even  suggest  that  a 
machine  should  be  able  to  recognize  digits  as  well  as  a  person  could.  Because  the  field  has  taken  a 
more  realistic  approach,  tbe  field  now  has  a  fairly  good  public  image,  artd  is  appearing  to  be  making 
progress  at  a  reasonable  rate: 

“Slowly  but  surely,  the  technology  is  making  its  way  into  the  real  world.”  (Schwartz.  1991, 

Business  Week,  p.  130) 


But  there  was  a  time  when  speech  researchers  were  much  more  ambitious.  According  to  Klatt's 
review  (Klatt,  1977),  the  first  ARPA  Speech  Understanding  project  (Newell  et  al,  1973)  haJ  the 
objective  of  obtaining  a  breakthrough  in  speech  understanding  capability  that  could  then  be  used 
toward  the  development  of  practical  man-machine  communication  systems.  Even  though  Harpy 
(Lowerre  and  Reddy,  1980)  did  in  fact  exceed  tbe  specific  goals  of  tbe  project  (e.g.,  accept  a  thousand 
word-vocabulary  connected-speech  with  an  artificial  syntax  and  semantics  and  produce  less  than  10% 
semantic  error  in  a  few  times  real  time  on  a  1(X)  mips  machine),  it  didn’t  matter  because  Harpy  bad 
failed  to  obtain  the  anticipated  breakthrough.  And  consequently,  funding  in  speech  recognition  and 
understanding  was  dramatically  reduced  over  tbe  following  decade.  When  activity  was  eventually 
resumed  many  years  later,  the  community  had  learned  that  it  is  ok  to  strive  toward  realistic  goals,  and 
that  it  can  be  dangerous  to  talk  about  breakthroughs. 

7.1  Tbe  GU  Experiment 

The  experience  in  machine  translation  is  perhaps  even  more  sobering.  The  1954  Georgetown  University 
(GU)  experiment  was  a  classic  example  of  a  success  catastrophe.  In  Zarechnak’s  1979  review  of  early 
work  on  machine  translation,  be  recalled  that  the  GU  experiment  was  originally  seen  as  a  huge  advance: 

“The  result  of  GU  machine  translation  was  given  wide  publicity  in  1954  when  it  was  announced  in 
New  York.  The  announcement  was  greeted  by  astonishment  and  skepticism  among  some  people.  L.  E. 
Dostert  summarized  the  result  of  the  experiment  as  being  an  authentic  machine  translation  which  does 
not  require  pre-editing  of  the  input  nor  post-editing  of  the  output”  (Zareebnak,  1979,  p.  28) 

But  now,  we  can  look  back  and  see  that  the  1954  GU  experiment  probably  did  more  harm  than  good  by 
setting  expectations  at  such  an  unrealistic  level  that  dtey  could  probably  never  be  met  Ten  years  after  the 
GU  experiment  the  ALP  AC  report  compared  four  then-current  systems  with  the  earlier  GU  experiment  and 
suggested  that  there  had  not  been  much  progress. 

“The  reader  will  find  it  instructive  to  compare  the  samples  above  with  the  results  obtained  on  simple,  or 
selected,  text  10  years  earlier  (the  Georgetown-IBM  Experiment  January  7,  1954)  in  that  the  earlier 
samples  are  more  readable  than  the  later  ones.”  (Pierce  et  ai,  1966) 


Zarechnak,  a  member  of  the  Georgetown  effort  complained  rather  bitterly  that  the  comparison  was  unfair. 
In  reality,  the  1954  GU  experiment  had  been  a  canned  demo  of  the  worst  kind,  whereas  the  four  systems 
developed  during  the  1960s  were  intended  to  handle  large  quantities  of  previously  unseen  text. 
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“When  ten  years  later  a  text  of  one  hundred  thousand  words  was  translated  on  a  computer  without 
being  previously  examined,  one  would  expect  a  certain  number  of  mors  on  all  levels  of  (iterations,  and 
the  need  for  post-editing.  Tbe  small  text  in  19S4  has  no  such  randcnn  data  to  translate."  (Zarechnak, 
1979,  p.  56) 


In  fact,  tbe  ALPAC  coounittee  had  also  appreciated  tbe  "toy'*-ish  aspects  of  tbe  1954  GU  experiment,  but 
they  did  not  feel  that  that  was  an  adequate  excuse.  They  criticized  both  tbe  1954  experiment  as  well  as  tbe 
four  systems  in  question,  tbe  former  for  setting  expectations  unreaUstically  high,  and  the  latter  for  failing  to 
meet  those  expectations,  unrealistic  as  they  may  be. 

"The  development  of  tbe  electronic  digital  computer  quickly  suggested  that  machine  translatioo  might 
be  possible.  The  idea  captured  the  imagination  of  scholars  and  administrators.  The  practical  goal  was 
simple;  to  go  from  machine-readable  foreign  technical  text  to  useful  English  text,  accurate,  readable, 
and  ultimately  indistinguishable  from  text  written  by  an  American  scientist.  Early  machine  translations 
of  simple  or  selected  text,  such  as  those  given  above,  were  as  deceptively  encouraging  as  'machine 
translations’  of  general  scientific  text  have  been  uniformly  discouraging."  (Pierce  ei  al.,  1966,  pp.  23- 
24) 


If  expectations  had  been  properly  managed  and  tbe  waters  had  not  been  poisoned  by  the  1954  GU 
experiment,  it  is  possible  that  we  would  now  look  back  on  tbe  MT  effort  during  tbe  1960s  from  a  much 
more  positive  perspective.  In  fact,  one  of  tbe  four  systems  in  question  later  became  known  as  Systran,  and 
is  still  in  wide  use  today.  In  this  sense,  early  work  on  MT  was  much  more  successful  than  early  work  on 
Speech  Understanding;  the  first  ARP  A  Speech  Understanding  Project  did  not  produce  any  systems  with  tbe 
same  longevity  as  Systran. 

For  some  reason  that  is  difficult  to  understand,  tbe  two  fields  currently  have  entirely  different  public 
images;  on  the  one  hand,  tbe  laymen  can  readily  recognize  that  it  is  extremely  difficult  for  a  machine  to 
recognize  speech,  while,  on  tbe  other  hand,  even  tbe  manager  of  a  translation  service  will  blindly  accept  tbe 
most  preposterous  pretensions  of  practically  any  MT  salesman.  Perhaps  we  can  change  ibis  perception  if 
we  succeed  in  focusing  our  attention  on  gocxl  applications  of  state-of-the-art  (i.e.,  crummy)  machine 
translation. 
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